12 research outputs found
Evaluation of ChatGPT on Biomedical Tasks: A Zero-Shot Comparison with Fine-Tuned Generative Transformers
ChatGPT is a large language model developed by OpenAI. Despite its impressive
performance across various tasks, no prior work has investigated its capability
in the biomedical domain yet. To this end, this paper aims to evaluate the
performance of ChatGPT on various benchmark biomedical tasks, such as relation
extraction, document classification, question answering, and summarization. To
the best of our knowledge, this is the first work that conducts an extensive
evaluation of ChatGPT in the biomedical domain. Interestingly, we find based on
our evaluation that in biomedical datasets that have smaller training sets,
zero-shot ChatGPT even outperforms the state-of-the-art fine-tuned generative
transformer models, such as BioGPT and BioBART. This suggests that ChatGPT's
pre-training on large text corpora makes it quite specialized even in the
biomedical domain. Our findings demonstrate that ChatGPT has the potential to
be a valuable tool for various tasks in the biomedical domain that lack large
annotated data.Comment: Accepted by BioNLP@ACL 202
A Comprehensive Evaluation of Large Language Models on Benchmark Biomedical Text Processing Tasks
Recently, Large Language Models (LLM) have demonstrated impressive capability
to solve a wide range of tasks. However, despite their success across various
tasks, no prior work has investigated their capability in the biomedical domain
yet. To this end, this paper aims to evaluate the performance of LLMs on
benchmark biomedical tasks. For this purpose, we conduct a comprehensive
evaluation of 4 popular LLMs in 6 diverse biomedical tasks across 26 datasets.
To the best of our knowledge, this is the first work that conducts an extensive
evaluation and comparison of various LLMs in the biomedical domain.
Interestingly, we find based on our evaluation that in biomedical datasets that
have smaller training sets, zero-shot LLMs even outperform the current
state-of-the-art fine-tuned biomedical models. This suggests that pretraining
on large text corpora makes LLMs quite specialized even in the biomedical
domain. We also find that not a single LLM can outperform other LLMs in all
tasks, with the performance of different LLMs may vary depending on the task.
While their performance is still quite poor in comparison to the biomedical
models that were fine-tuned on large training sets, our findings demonstrate
that LLMs have the potential to be a valuable tool for various biomedical tasks
that lack large annotated data.Comment: Extended version of the following BioNLP paper:
https://aclanthology.org/2023.bionlp-1.30/ (arXiv:2306.04504). arXiv admin
note: substantial text overlap with arXiv:2306.0450
Improving Named Entity Recognition in Telephone Conversations via Effective Active Learning with Human in the Loop
Telephone transcription data can be very noisy due to speech recognition
errors, disfluencies, etc. Not only that annotating such data is very
challenging for the annotators, but also such data may have lots of annotation
errors even after the annotation job is completed, resulting in a very poor
model performance. In this paper, we present an active learning framework that
leverages human in the loop learning to identify data samples from the
annotated dataset for re-annotation that are more likely to contain annotation
errors. In this way, we largely reduce the need for data re-annotation for the
whole dataset. We conduct extensive experiments with our proposed approach for
Named Entity Recognition and observe that by re-annotating only about 6%
training instances out of the whole dataset, the F1 score for a certain entity
type can be significantly improved by about 25%.Comment: The final version of this paper will be published in the Proceedings
of the DaSH Workshop @ EMNLP 2022. This paper is accepted for presentation in
both DaSH@EMNLP 2022 and HiLL@NIPS 202
ChartSumm: A Comprehensive Benchmark for Automatic Chart Summarization of Long and Short Summaries
Automatic chart to text summarization is an effective tool for the visually
impaired people along with providing precise insights of tabular data in
natural language to the user. A large and well-structured dataset is always a
key part for data driven models. In this paper, we propose ChartSumm: a
large-scale benchmark dataset consisting of a total of 84,363 charts along with
their metadata and descriptions covering a wide range of topics and chart types
to generate short and long summaries. Extensive experiments with strong
baseline models show that even though these models generate fluent and
informative summaries by achieving decent scores in various automatic
evaluation metrics, they often face issues like suffering from hallucination,
missing out important data points, in addition to incorrect explanation of
complex trends in the charts. We also investigated the potential of expanding
ChartSumm to other languages using automated translation tools. These make our
dataset a challenging benchmark for future research.Comment: Accepted as a long paper at the Canadian AI 202
A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark Datasets
The development of large language models (LLMs) such as ChatGPT has brought a
lot of attention recently. However, their evaluation in the benchmark academic
datasets remains under-explored due to the difficulty of evaluating the
generative outputs produced by this model against the ground truth. In this
paper, we aim to present a thorough evaluation of ChatGPT's performance on
diverse academic datasets, covering tasks like question-answering, text
summarization, code generation, commonsense reasoning, mathematical
problem-solving, machine translation, bias detection, and ethical
considerations. Specifically, we evaluate ChatGPT across 140 tasks and analyze
255K responses it generates in these datasets. This makes our work the largest
evaluation of ChatGPT in NLP benchmarks. In short, our study aims to validate
the strengths and weaknesses of ChatGPT in various tasks and provide insights
for future research using LLMs. We also report a new emergent ability to follow
multi-query instructions that we mostly found in ChatGPT and other
instruction-tuned models. Our extensive evaluation shows that even though
ChatGPT is capable of performing a wide variety of tasks, and may obtain
impressive performance in several benchmark datasets, it is still far from
achieving the ability to reliably solve many challenging tasks. By providing a
thorough assessment of ChatGPT's performance across diverse NLP tasks, this
paper sets the stage for a targeted deployment of ChatGPT-like LLMs in
real-world applications.Comment: Accepted by ACL 2023 Findings. The first three authors contributed
equall
AI Coach Assist: An Automated Approach for Call Recommendation in Contact Centers for Agent Coaching
In recent years, the utilization of Artificial Intelligence (AI) in the
contact center industry is on the rise. One area where AI can have a
significant impact is in the coaching of contact center agents. By analyzing
call transcripts using Natural Language Processing (NLP) techniques, it would
be possible to quickly determine which calls are most relevant for coaching
purposes. In this paper, we present AI Coach Assist, which leverages the
pre-trained transformer-based language models to determine whether a given call
is coachable or not based on the quality assurance (QA) questions asked by the
contact center managers or supervisors. The system was trained and evaluated on
a large dataset collected from real-world contact centers and provides an
effective way to recommend calls to the contact center managers that are more
likely to contain coachable moments. Our experimental findings demonstrate the
potential of AI Coach Assist to improve the coaching process, resulting in
enhancing the performance of contact center agents.Comment: ACL 2023 Industry Trac
BenLLMEval: A Comprehensive Evaluation into the Potentials and Pitfalls of Large Language Models on Bengali NLP
Large Language Models (LLMs) have emerged as one of the most important
breakthroughs in natural language processing (NLP) for their impressive skills
in language generation and other language-specific tasks. Though LLMs have been
evaluated in various tasks, mostly in English, they have not yet undergone
thorough evaluation in under-resourced languages such as Bengali (Bangla). In
this paper, we evaluate the performance of LLMs for the low-resourced Bangla
language. We select various important and diverse Bangla NLP tasks, such as
abstractive summarization, question answering, paraphrasing, natural language
inference, text classification, and sentiment analysis for zero-shot evaluation
with ChatGPT, LLaMA-2, and Claude-2 and compare the performance with
state-of-the-art fine-tuned models. Our experimental results demonstrate an
inferior performance of LLMs for different Bangla NLP tasks, calling for
further effort to develop better understanding of LLMs in low-resource
languages like Bangla.Comment: First two authors contributed equall
Utilizing the Transformer Architecture for Question Answering
The Question Answering (QA) task aims at building systems that can automatically answer a question or query about the given document(s). In this thesis, we utilize the transformer, a state-of-the-art neural architecture to study two QA problems: the answer sentence selection and the answer summary generation. For answer sentence selection, we present two new approaches that rank a list of candidate answers for a given question by utilizing different contextualized embeddings with the encoder of transformer. For answer summary generation, we study the query focused abstractive text summarization task to generate a summary in natural language from the source document(s) for a given query. For this task, we utilize transformer to address the lack of large training datasets issue in single-document scenarios and no labeled training datasets issue in multi-document scenarios. Based on extensive experiments, we observe that our proposed approaches obtain impressive results across several benchmark QA datasets
Domain Adaptation with Pre-trained Transformers for Query-Focused Abstractive Text Summarization
The Query-Focused Text Summarization (QFTS) task aims at building systems that generate the summary of the text document(s) based on the given query. A key challenge in addressing this task is the lack of large labeled data for training the summarization model. In this article, we address this challenge by exploring a series of domain adaptation techniques. Given the recent success of pre-trained transformer models in a wide range of natural language processing tasks, we utilize such models to generate abstractive summaries for the QFTS task for both single-document and multi-document scenarios. For domain adaptation, we apply a variety of techniques using pre-trained transformer-based summarization models including transfer learning, weakly supervised learning, and distant supervision. Extensive experiments on six datasets show that our proposed approach is very effective in generating abstractive summaries for
the QFTS task while setting a new state-of-the-art result in several datasets across a set of automatic and human evaluation metrics